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Register allocation for software pipelined multi-dimensional loops 
Hongbo Rong, Alban Douillet, Guang R. Gao 

June 2005 ACM SIGPLAN Notices , Proceedings of the 2005 ACM SIGPLAN conference 
on Programming language design and implementation PLDI '05, Volume 40 
Issue 6 

Publisher: ACM Press 

Full text available: ^j) pdf(745.17 KB) Additional Information: full citation , abstract , references , index terms 

Software pipelining of a multi-dimensional loop is an important optimization that overlaps 
the execution of successive outermost loop iterations to explore instruction-level 
parallelism from the entire n-dimensional iteration space. This paper investigates register 
allocation for software pipelined multi-dimensional loops. For single loop software 
pipelining, the lifetime instances of a loop variant in successive iterations of the loop form 
a repetitive pattern. An effective register allocation m ... 
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2 Graph coloring register allocation for processors with multi-register operands 
Brian R. Nickerson 

June 1990 ACM SIGPLAN Notices , Proceedings of the ACM SIGPLAN 1990 conference 
on Programming language design and implementation PLDI '90, Volume 25 
Issue 6 0 
Publisher: ACM Press 

Full text available: fW(1.41 MB) Agonal Information: full citation , abstract, references, citings, index 
10 terms 

Though graph coloring algorithms have been shown to work well when applied to register 
allocation problems, the technique has not been generalized for processor architectures in 
which some instructions refer to individual operands that are comprised of multiple 
registers. This paper presents a suitable generalization. 

Register allocation for irregular architectures 
Bernhard Scholz, Erik Eckstein 

June 2002 ACM SIGPLAN Notices , Proceedings of the joint conference on Languages, 
compilers and tools for embedded systems: software and compilers for 
embedded systems LCTES/SCOPES '02, Volume 37 issue 7 
Publisher: ACM Press © 

Additional Information: full citation , abstract , references , citings , index 
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Full text available: pdf(220.34 KB) terms 

For irregular architectures global register allocation is still a challenging problem that has 
not been successfully solved so far. The graph-coloring analogy of traditional approaches 
does not match the needs of register allocation for such architectures which feature non- 
orthogonal instruction sets and small register files. This work proposes a fundamentally 
new approach to global register allocation for irregular architectures. Our approach 
formulates global allocation as a partitioned boo ... 

Keywords: boolean quadratic problem, register allocation 



The evolution of the DECsystem 10 

C. G. Bell, A. Kotok, T. N. Hastings, R. Hill 

January 1978 Communications of the ACM, volume 21 issue 1 

Publisher: ACM Press 

Full text available* fiBpdft!92 MB) Additional Information: full citation, abstract , references , citings , index 
'™ M terms 

The DECsystem 10, also known as the PDP-10, evolved from the PDP-6 (circa 1963) over 
five generations of implementations to presently include systems covering a price range of 
five to one. The origin and evolution of the hardware, operating system, and languages 
are described in terms of technological change, user requirements, and user 
developments. The PDP-10's contributions to computing technology include: accelerating 
the transition from batch oriented to time sharing computing systems; ... 

Keywords: architecture, computer structures, operating system, timesharing 



On the scheduling of variable latency functional units 
Silva M. Mueller 

June 1999 Proceedings of the eleventh annual ACM symposium on Parallel 

algorithms and architectures 
Publisher: ACM Press 

Full text available: ^ pdf(731.68 KB) Additional Information: full citation , references , index terms 



6 HDL-based modeling of embedded processor behavior for retarqetable compilation Q 
Rainer Laupers 

December 1998 Proceedings of the 11th international symposium on System 

synthesis 
Publisher: IEEE Computer Society 

Full text available: 153 pdf(538.54 KB) 

JiT" Additional Information: full citation , references , index terms 

^P ! Publisher Site 



7 A design verification methodology based on concurrent simulation and clock 
suppression 
Ernst Ulrich 

June 1983 Proceedings of the 20th conference on Design automation 
Publisher: IEEE Press 

Full text available- ^ pdf(444 78 KB) Add'*' 003 ' Information: full citation , abstract , references , citings , index 
This Paper outlines a methodology for design verification of very large networks based on 
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Concurrent Simulation and Clock Suppression. Concurrent Simulation is expected to yield 
a 30:1 to 600:1 speed advantage over conventional (serial) simulation when roughly 
5,000 "good machines" are simulated concurrently. This speed advantage increases with 
the number of concurrent machines. Clock Suppression is an auxiliary technique to avoid 
simulatio ... 

Trace cache: a low latency approach to high bandwidth instruction fetching 
Eric Rotenberg, Steve Bennett, James E. Smith 

December 1996 Proceedings of the 29th annual ACM/IEEE international symposium 
on Microarchitecture 

Publisher: IEEE Computer Society 

Full text available: «pdff1.38 MB) Additional Information: full citation , abstract, references , citings, index 
""^ terms 

As the issue width of superscalar processors is increased, instruction fetch bandwidth 
requirements will also increase. It will become necessary to fetch multiple basic blocks per 
cycle. Conventional instruction caches hinder this effort because long instruction 
sequences are not always in contiguous cache locations. We propose supplementing the 
conventional instruction cache with a trace cache. This structure caches traces of the 
dynamic instruction stream, so instructions that are otherwise no ... 

Keywords: instruction cache, instruction fetching, multiple branch prediction, superscalar 
processors, trace cache 



9 Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code Q 
Jeremy D. Frens, David S. Wise 

June 1997 ACM SIGPLAN Notices , Proceedings of the sixth ACM SIGPLAN symposium 
on Principles and practice of parallel programming PPOPP '97, Volume 32 
Issue 7 
Publisher: ACM Press 

Full text available: ffipdff1.06MB) Additional Information: full citation , abstract, references , citings, index 

terms 

An elementary, machine-independent, recursive algorithm for matrix multiplication 
C+=A*B provides implicit blocking at every level of the memory hierarchy and tests out 
faster than classically optimrd code, tracking hand-coded BLAS3 routines. Proof of concept 
is demonstrated by racing the in-place algorithm against manufacturer's hand-tuned 
BLAS3 routines; it can win.The recursive code bifurcates naturally at the top level into 
independent block-oriented processes, that each writes to a d ... 

Keywords: cache misses, indexing, paging, quadtrees, storage management, swapping 




10 Embedded applications: AES and the cryptonite crypto processor 
Dino Oliva, Rainer Buchty, Nevin Heintze 

October 2003 Proceedings of the 2003 international conference on Compilers, 

architecture and synthesis for embedded systems 
Publisher: ACM Press 

Full text available: pdf(346.09 KB^ Additional Information: full citation , abstract , references , index terms 

CRYPTONITE is a programmable processor tailored to the needs of crypto algorithms. The 
design of CRYPTONITE was based on an in-depth application analysis in which standard 
crypto algorithms (AES, DES, MD5, SHA-1, etc) were distilled down to their core 
functionality. We describe this methodology and use AES as a central example. Starting 
with a functional description of AES, we give a high level account of how to implement AES 
efficiently in hardware, and present several novel optimizations (whic ... 
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11 Functional fault modeling and simulation for VLSI devices 
Anil K. Gupta, James R. Armstrong 

June 1985 Proceedings of the 22nd ACM/IEEE conference on Design automation 

Publisher: ACM Press 

Full text available: «pdff701.13KB) Additional Information: full citation , abstract, references , dtings, index 
^ terms 

Functional fault modeling and simulation for VLSI devices is described(*). A functional 
fault list is compiled using model perturbation and mapping of circuit defects into 
functional faults. A set of test vectors is then derived which detects all faults in the 
functional fault list. This same test vector set is then applied to a gate level model of the 
device. For the test case analyzed, a very high level of equivalent gate coverage was 
achieved. Conclusions are drawn a ... 

12 Contributed articles: lota flow with direct local functions 
Richard H. Oates 

March 1981 ACM SIGAPL APL Quote Quad, Volume u issue 3 
Publisher: ACM Press 

Full text available: ^ pdf(908.31 KB) Additional Information: full citation , references , citings 





13 Depth optimal sorting networks resistant to k passive faults 
Marek Piotrow 

January 1996 Proceedings of the seventh annual ACM-SIAM symposium on Discrete 
algorithms 

Publisher: Society for Industrial and Applied Mathematics 

Full text available: ^j) pdf(1.19 MB) Additional Information: full citation , references , citings , index terms 



14 Automated proofs of object code for a widely used microprocessor 
Robert S. Boyer, Yuan Yu 

January 1996 Journal of the ACM (JACM), volume 43 issue l 
Publisher: ACM Press 

Full text available: « pdf(2.41 MB) Additional Information: full citation , references , citings, index terms, 
^ review 




Keywords: Ada, Boyer-Moore logic, C, Common Lisp, MC68xxx, Nqthm, automated 
reasoning, formal methods, machine code, mechanical theorem proving, object code, 
program proving, program verification 



15 Language support for Morton-order matrices 

David S. Wise, Jeremy D. Frens, Yuhong Gu, Gregory A. Alexander 
June 2001 ACM SIGPLAN Notices , Proceedings of the eighth ACM SIGPLAN 

symposium on Principles and practices of parallel programming PPoPP 

'01, Volume 36 Issue 7 
Publisher: ACM Press 

Full text available* Additional Information: full citation , abstract , references , citings , index 
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■ gpdf(372.85 KB) terms 

The uniform representation of 2-dimensional arrays serially in Morton order (or {\eee} 
order) supports both their iterative scan with cartesian indices and their divide-and- 
conquer manipulation as quaternary trees. This data structure is important because it 
relaxes serious problems of locality and latency, and the tree helps to schedule multi- 
processing. Results here show how it facilitates algorithms that avoid cache misses and 
page faults at all levels in hierarchical memory, independent ... 

Keywords: paging, quadtrees 



16 The automatic indexing system AIR/PHYS - from research to applications Q 
P. Biebricher, N. Fuhr, G. Lustig, M. Schwantner, G. Knorz 

May 1988 Proceedings of the 11th annual international ACM SIGIR conference on 
Research and development in information retrieval 

Publisher: ACM Press 

Full text available: I B pdf(785.93 KB) Additional Information: full citation , abstract, references , dtrngs, index 
^ terms 

Since October 1985, the automatic indexing system AIR/PHYS has been used in the input 
production of the physics data base of the Fachinformationsentrum Karlsruhe/West 
Germany. The texts to be indexed are abstracts written in English. The system of 
descriptors is prescribed. For the application of the AIR/PHYS system a large-scale 
dictionary containing more than 600 000 word-descriptor relations reap, phrase-descriptor 
relations has been developed. Most of these relations have been obtained ... 

17 A novel renaming mechanism that boosts software prefetching Q 
Daniel Ortega, Mateo Valero, Eduard Ayguade 

June 2001 Proceedings of the 15th international conference on Supercomputing 
Publisher: ACM Press 

Full text available: ^pdf(214.40 KB) Additional Information: full citation , abstract , references , index terms 

The detection and correct handling of data and control dependencies constitutes one of the 
biggest issues to expose ILP in current architectures. The ever increasing memory 
latencies and working space of programmes are making prefetching techniques crucial for 
the attainment of sustained high performance. Software prefetching allows the compiler to 
use information discovered at compile-time to effectively bring needed data before it is 
used, thus hiding all or part of the latency from main me ... 

18 Architectural considerations for a microprogrammable emulating engine using bit- Q 
^ slices 

v C. Halatsis, A. van Dam, J. Joosten, M. Letheren 

May 1980 Proceedings of the 7th annual symposium on Computer Architecture 
Publisher: ACM Press 

Full text available: ^ pdf(951.32 KB) Additional Information: full citation , abstract , references , index terms 

This paper describes architectural considerations which led to the design of a fast 
programmable processor made from ECL bit-slioes. The processor will be used as an on- 
line data filtering engine for high energy physics experiments. Unlike prior designs of such 
engines, the processor supports both user (horizontal) microcode and emulation of the 
PDP-11 fixed point instruction set (without memory management and multiple interrupt 
levels). In addition to an overview of the techniques used to ... 

19 Periodic, random-fault-tolerant correction networks Q 
Marek Piotrow 

July 2001 Proceedings of the thirteenth annual ACM symposium on Parallel 
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algorithms and architectures 

Publisher: ACM Press 

Full text available- fi3pdf(173 65 KB) Addrt,onal Information: full citation, abstract, references, citings, index 
' tel-^— * : terms 

We study the problem of sorting sequences of A/-keys that can be obtained from sorted 
ones by changing values of s, 0 < s &lne; N, keys at unknown positions. Such s-disturbed 
sequences can appear as outputs of a sorting network that contains faulty comparators. 
We present a simple comparator network of depth 4 that sorts 1-disturbed sequences in 
logarithmic time, where the network is used repeatedly, i.e. if its output is not sorted, the 
network is run aga ... 

Keywords: comparators, correction networks, fault-tolerant sorting, sorting networks 



20 Integrating superscalar processor components to implement register caching 
Matthew Postiff, David Greene, Steven Raasch, Trevor Mudge 

June 2001 Proceedings of the 15th international conference on Supercomputing 

Publisher: ACM Press 

Full text available: fS P df(146.37 KB) Additional information: full citation , abstract, references , dtings, index 
^ terms 

A large logical register file is important to allow effective compiler transformations or to 
provide a windowed space of registers to allow fast function calls. Unfortunately, a large 
logical register file can be slow, particularly in the context of a wide-issue processor which 
requires an even larger physical register file, and many read and write ports. Previous 
work has suggested that a register cache can be used to address this problem. This paper 
proposes a new register caching mechanism ... 
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